1 Introduction

In this workshop, we will explore the advanced features of the Nextflow language and runtime, and learn how to use them to write efficient and scalable data-intensive workflows.

2 Objectives

3 Tutorial

We will cover topics such as parallel Channels, Processes, and Operators.

3.1 Processes

We now know how to create and use Channels to send data around a workflow. We will now see how to run tasks within a workflow using processes.

A process is the way Nextflow executes commands you would run on the command line or custom scripts.

The syntax is defined as follows:

process < NAME > {
  [ directives ]
  input:
  < process inputs >
  output:
  < process outputs >
  when:
  < condition >
  [script|shell|exec]:
  < user script to be executed >
}

For example:

Example:
bash
cat bin/example_process.nf
#!/usr/bin/env nextflow
params.input = "/home/sinrasu/sinrasu-test/*_{1,2}.fastq.gz"

process FASTQC {

    tag "$meta.id"
    cpus 2
    memory 1.GB

    conda "bioconda::fastqc=0.11.9"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' :
        'biocontainers/fastqc:0.12.1--hdfd78af_0' }"

    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("*.html"), emit: html
    tuple val(meta), path("*.zip") , emit: zip

    script:
    """
    fastqc $reads
    """
}

workflow {
    sample_ch = Channel.fromFilePairs(params.input, checkIfExists: true)
    sample_ch.view()
    FASTQC(sample_ch)
}
Output:
bash
nextflow run bin/example_collect.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_collect.nf` [silly_ride] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]

3.1.1 Directives

Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.

Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information.

List of directives

Name Description
cpus Allows you to define the number of (logical) CPUs required by the process’ task.
time Allows you to define how long the task is allowed to run (e.g., time 1h: 1 hour, 1s 1 second, 1m 1 minute, 1d 1 day).
memory Allows you to define how much memory the task is allowed to use (e.g., 2 GB is 2 GB). Can also use B, KB,MB,GB and TB.
disk Allows you to define how much local disk storage the task is allowed to use.
tag Allows you to associate each process execution with a custom label to make it easier to identify them in the log file or the trace execution report.
publishDir Allows you to save important, non-intermediary, and/or final files in a results folder.

3.2 Operators

In this chapter, we take a curated tour of the Nextflow operators. Commonly used and well understood operators are not covered here - only those that we’ve seen could use more attention or those where the usage could be more elaborate. These set of operators have been chosen to illustrate tangential concepts and Nextflow features.

3.2.1 flatten()

The flatten operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.

Example:
bash
cat bin/example_flatten.nf
input_ch = Channel.from(["path/file1.fastq", "path/file2.fastq"], ["path/file3.fastq", "path/file4.fastq"], ["path/file5.fastq", "path/file6.fastq"])

input_ch
    .flatten()
    .view()
Output:
bash
nextflow run bin/example_flatten.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_flatten.nf` [high_bardeen] DSL2 - revision: d714f8acfd
## path/file1.fastq
## path/file2.fastq
## path/file3.fastq
## path/file4.fastq
## path/file5.fastq
## path/file6.fastq

3.2.2 collect()

The collect operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.

Example:
bash
cat bin/example_collect.nf
workflow {
  sample_ch = Channel.fromFilePairs("./data/reads/*_{1,2}.fastq.gz", checkIfExists:true)
                .collect()
  sample_ch.view()
}
Output:
bash
nextflow run bin/example_collect.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_collect.nf` [shrivelled_bohr] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]

3.2.3 groupTuple()

The groupTuple operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.

Example:
bash
cat bin/example_groupTuple.nf
ch = channel
     .of( ['wt','wt_1.fq'], ['wt','wt_2.fq'], ["mut",'mut_1.fq'], ['mut', 'mut_2.fq'] )
     .groupTuple()
     .view()
Output:
bash
nextflow run bin/example_groupTuple.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_groupTuple.nf` [stoic_tuckerman] DSL2 - revision: 14dfed2105
## [wt, [wt_1.fq, wt_2.fq]]
## [mut, [mut_1.fq, mut_2.fq]]

3.2.4 branch()

The branch operator allows you to forward the items emitted by a source channel to one or more output channels.

The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier. For example:

bash
cat bin/example_branch.nf
workflow {
    params.input = "data/samplesheet.csv"

    Channel.fromPath(params.input)
      .splitCsv(header: true)
      .map{ row -> [[sample:row.sample,strand:row.type],[row.fastq_1,row.fastq_2]]}
      .branch{ meta, reads ->
        single: meta.strand == "single"
        paired: meta.strand == "paired"
      }
      .set { samples }
    samples.single.view()
}
Output:
bash
nextflow run bin/example_branch.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_branch.nf` [jovial_heyrovsky] DSL2 - revision: 8b1f242980
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz]]
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz]]
## [[sample:WT_REP2, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz]]
## [[sample:RAP1_IAA_30M_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz]]

3.2.5 splitCsv()

A common Nextflow pattern is for a simple samplesheet to be passed as primary input into a workflow. We’ll see some more complicated ways to manage these inputs later on in the workshop, but the splitCsv (docs) is an excellent tool to have in a pinch. This operator will parse a csv/tsv and return a channel where each item is a row in the csv/tsv:

bash
cat bin/example_splitCsv.nf
workflow {
    params.input = "https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/v3.10/samplesheet_test.csv"

    Channel.fromPath(params.input)
      .splitCsv(header: true)
      .view()
}

Output:

bash
nextflow run bin/example_splitCsv.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_splitCsv.nf` [cheesy_leibniz] DSL2 - revision: 5650611e54
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz, strandedness:auto]
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz, strandedness:auto]
## [sample:WT_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_IAA_30M_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz, strandedness:reverse]

3.2.6 multiMap()

The multiMap operator is a way of taking a single input channel and emitting into multiple channels for each input element.

Example:

bash
cat bin/example_multiMap.nf
Channel.of( 1, 2, 3, 4, 5 )
    .multiMap {
        small: it
        large: it * 10
    }
    .set { numbers }

numbers.small | view { num -> "Small: $num"}
numbers.large | view { num -> "Large: $num"}
Output:
bash
nextflow run bin/example_multiMap.nf
## N E X T F L O W  ~  version 23.04.1
## Launching `bin/example_multiMap.nf` [confident_cray] DSL2 - revision: 3d406c0ad2
## Large: 10
## Large: 20
## Large: 30
## Large: 40
## Large: 50
## Small: 1
## Small: 2
## Small: 3
## Small: 4
## Small: 5

4 Exercises

The exercises below are designed to strengthen your knowledge in Nextflow more. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.

  1. Create a Nextflow module for samtools to generate an output structure in the directory “results,” with a separate sub-directory for each given sample ID, and within each sub-directory, the results needs to be there

Your Nextflow module should include the following:

  • Input parameters for specifying the input data (e.g., aligned BAM files).

  • script should contain any two functions from samtools.

  • setup the process to run on minimal no.of cpus and tag the process with sample ids

  1. Given sample sheet contain multiple entries for same sample and mixed of single and paired end, create separate channel for single and multiple entries.
Output
               sample
1             WT_REP1
2             WT_REP1
3             WT_REP2
4 RAP1_UNINDUCED_REP1
5 RAP1_UNINDUCED_REP2
6 RAP1_UNINDUCED_REP2
7   RAP1_IAA_30M_REP1
                                                                                                  fastq_1
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
4 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
5 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
6 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
                                                                                                  fastq_2
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
4                                                                                                        
5                                                                                                        
6                                                                                                        
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
  strandedness
1         auto
2         auto
3      reverse
4      reverse
5      reverse
6      reverse
7      reverse